fix: remove panic() from all reconciliation and CLI paths by flemzord · Pull Request #477 · formancehq/operator

flemzord · 2026-06-09T20:18:00Z

Why

The project review (see review.md, finding C2/C3 and friends) identified 31 panic() call sites in production code. In a controller-runtime operator, any panic in a reconcile path, watch event handler, or shared helper crashes the whole controller-manager: a single malformed resource or transient API-server error takes down reconciliation for every stack in the cluster.

This PR removes every production panic() — 30 sites. The only one left is in internal/tests/internal/bootstrap.go (test bootstrap, where failing fast is idiomatic and Ginkgo owns the process).

One commit per panic fix so each change can be reviewed and reverted independently (panics fixed by a single coherent change — e.g. the two embedded-FS reads inside one closure — share a commit).

Approach by panic class

Class	Fix
Helper already in an error-returning context	Propagate the error (`return err`, wrapped where useful)
Helper without an error return (`HashFromConfigMaps`, `LowerCamelCaseKind`, `LowerCaseKind`, `GetPublisherEnvVars`, `HashFromHash`)	Signature changed to return `(T, error)`; all callers updated to propagate through their existing error paths
Watch map functions / event handlers (cannot return errors)	Log via `log.FromContext(ctx)` and drop the event; per-stack loops `continue` instead of dying so one bad stack doesn't lose the others' events
`init()`-time scheme/equality registration	`utilruntime.Must(...)`, the idiom already used in `cmd/main.go`
`URI.UnmarshalJSON`	Returns the parse error so only the offending object's decode fails
Dead code (`core.CopyDir`, no callers)	Removed
kubectl-stacks CLI	Errors flow through cobra's normal stderr + exit-code path; the deferred unlock failure in `upgrade` is joined into the command error via `errors.Join`

Notable details

internal/resources/resourcereferences/init.go: the loop-invariant GVKForObject lookup was also hoisted out of the owner-references loop.
internal/resources/gatewayhttpapis/create.go: the kind was resolved twice for the same value; the second call now reuses the first result.
internal/resources/stacks/init.go: the condition-building closure now returns the conversion error after marking the module condition False ("Unable to read module status"), preserving the deferred condition write.
No behavior change for the happy path anywhere: hash computations, env-var construction and Caddy/benthos config generation produce byte-identical output.

Verification

go build ./... and go vet ./... pass on the main module and the tools/kubectl-stacks module (run after each commit for the touched package, plus a full pass at the end).
make test: all 8 Ginkgo envtest suites pass.

The ObjectMutator signature already propagates errors; panicking on SetOwnerReference failure would crash the whole controller-manager instead of failing the single reconciliation.

GetAllStackDependencies already returns an error; panicking on a FromUnstructured conversion failure would crash the controller-manager on a single malformed resource instead of failing one reconciliation.

GetAs already returns an error; panicking on json.Marshal failure would crash the controller-manager instead of failing the single settings lookup.

The CreateOrUpdate mutate closure already returns an error; panicking would crash the controller-manager instead of failing the single ResourceReference reconciliation.

The enclosing reconcile helper already returns an error; panicking on a hashing failure would crash the controller-manager instead of failing the single Benthos reconciliation.

UnmarshalJSON runs while decoding user-provided custom resources: a single CR carrying an unparseable URL would crash the whole controller-manager. Returning the error fails only the decode of the offending object.

Registering the URI semantic-equality function can only fail on a programmer error and must abort startup; utilruntime.Must is the established idiom for this (already used in cmd/main.go) and reports the failure through the k8s error handlers instead of a bare panic.

Same idiom as cmd/main.go: scheme registration failure is a programmer error that must abort startup; utilruntime.Must reports it through the k8s error handlers instead of a bare panic.

CopyDir has no callers outside its own recursion and contained two panic() calls on filesystem errors. Removing it eliminates both panic sites.

…s in WithWatchVersions A transient API-server error while listing stacks in the Versions watch handler would crash the whole controller-manager. Log the error and drop the event instead: the next Versions event re-triggers the mapping.

… in WithWatchVersions ObjectKinds failing for the watch target is a registration problem; crashing the controller-manager from an event handler hides the root cause. Log the error and drop the event instead.

…thWatchVersions A transient list error for one stack would crash the whole controller-manager and lose the events of every other stack. Log the error and continue with the remaining stacks.

…Func Same rationale as the other WithWatchVersions handlers: an event handler must not crash the controller-manager on a kind-resolution error.

A kind-resolution failure (unregistered type) would crash the whole controller-manager from any module deployment helper. The function now returns the error; the eleven callers propagate it through their existing error paths, and the databases watch handler logs it and drops the event.

Same rationale as LowerCamelCaseKind: kind-resolution failure must not crash the controller-manager. Also reuse the already-computed object name in gatewayhttpapis.Create instead of resolving the kind twice.

An unknown broker mode in a Broker status would crash the whole controller-manager from any module deployment helper. The function now returns an error that the five callers propagate through their existing error paths, failing only the offending reconciliation.

Converting a module's unstructured content to read its status would panic and crash the controller-manager on a single malformed module. The condition-building closure now returns the error (after marking the condition false) and the reconciliation fails normally.

…atchResource A kind-resolution error in the watch map function would crash the whole controller-manager. Log the error and drop the event instead. Also hoist the loop-invariant GVK lookup out of the owner-references loop.

createDeployment already has an error path; a hashing failure now fails the single Auth reconciliation instead of crashing the controller-manager.

A hashing failure would crash the whole controller-manager from any of the three callers (benthosstreams, auths, caddy). The function now returns the error and callers propagate it through their existing error paths.

…king The templates map was built in an immediately-invoked closure that panicked on embedded-FS read errors, crashing the controller-manager. Extract it into a buildTemplates helper that returns the error so the single Benthos reconciliation fails instead.

The function already returns an error; a CLI must report failures through its normal error path, not a panic stack trace.

…king Same rationale as enable: report failures through the command's normal error path.

…cking Same rationale as enable: report failures through the command's normal error path.

…icking Same rationale as enable: report failures through the command's normal error path.

…anicking Same rationale as enable: report failures through the command's normal error path.

… panicking A failure to unlock stacks after an upgrade is exactly the situation the user must be told about cleanly; join it with the command error instead of panicking past the cobra error handling.

Register the scheme in main() and exit with a printed error instead of panicking from init(), keeping all CLI failures on the same stderr+exit-code path.

coderabbitai · 2026-06-09T20:18:09Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5b3f65d0-1069-40f1-b362-35ff74be8244

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/remove-panics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

flemzord added 28 commits June 9, 2026 22:00

fix(core): return error instead of panicking in WithOwner mutator

e0148cb

The ObjectMutator signature already propagates errors; panicking on SetOwnerReference failure would crash the whole controller-manager instead of failing the single reconciliation.

fix(core): propagate conversion error in GetAllStackDependencies

b8936d5

GetAllStackDependencies already returns an error; panicking on a FromUnstructured conversion failure would crash the controller-manager on a single malformed resource instead of failing one reconciliation.

fix(settings): propagate marshal error in GetAs

d08f5de

GetAs already returns an error; panicking on json.Marshal failure would crash the controller-manager instead of failing the single settings lookup.

fix(resourcereferences): propagate SetNestedMap error in Reconcile

28ee590

The CreateOrUpdate mutate closure already returns an error; panicking would crash the controller-manager instead of failing the single ResourceReference reconciliation.

fix(benthos): propagate config map hashing error

e1835ac

The enclosing reconcile helper already returns an error; panicking on a hashing failure would crash the controller-manager instead of failing the single Benthos reconciliation.

fix(client): use utilruntime.Must for scheme registration in init

6f5f0aa

Same idiom as cmd/main.go: scheme registration failure is a programmer error that must abort startup; utilruntime.Must reports it through the k8s error handlers instead of a bare panic.

fix(core): remove dead CopyDir helper

638445b

CopyDir has no callers outside its own recursion and contained two panic() calls on filesystem errors. Removing it eliminates both panic sites.

fix(core): log and drop event instead of panicking on kind resolution…

d114050

… in WithWatchVersions ObjectKinds failing for the watch target is a registration problem; crashing the controller-manager from an event handler hides the root cause. Log the error and drop the event instead.

fix(core): skip stack instead of panicking when listing modules in Wi…

aa9b68d

…thWatchVersions A transient list error for one stack would crash the whole controller-manager and lose the events of every other stack. Log the error and continue with the remaining stacks.

fix(core): log and drop event instead of panicking in Versions Update…

cf331c6

…Func Same rationale as the other WithWatchVersions handlers: an event handler must not crash the controller-manager on a kind-resolution error.

fix(core): make LowerCaseKind return an error instead of panicking

7e13e83

Same rationale as LowerCamelCaseKind: kind-resolution failure must not crash the controller-manager. Also reuse the already-computed object name in gatewayhttpapis.Create instead of resolving the kind twice.

fix(auths): return error from HashFromHash instead of panicking

118e902

createDeployment already has an error path; a hashing failure now fails the single Auth reconciliation instead of crashing the controller-manager.

fix(kubectl-stacks): return marshal error in enable instead of panicking

17fbb35

The function already returns an error; a CLI must report failures through its normal error path, not a panic stack trace.

fix(kubectl-stacks): return marshal error in disable instead of panic…

0ea8348

…king Same rationale as enable: report failures through the command's normal error path.

fix(kubectl-stacks): return marshal error in setDebug instead of pani…

aa8bff9

…cking Same rationale as enable: report failures through the command's normal error path.

fix(kubectl-stacks): return marshal error in lockStack instead of pan…

e9eb6a7

…icking Same rationale as enable: report failures through the command's normal error path.

fix(kubectl-stacks): return marshal error in unlockStack instead of p…

397b1f2

…anicking Same rationale as enable: report failures through the command's normal error path.

fix(kubectl-stacks): join deferred unlock error in upgrade instead of…

284c381

… panicking A failure to unlock stacks after an upgrade is exactly the situation the user must be told about cleanly; join it with the command error instead of panicking past the cobra error handling.

fix(kubectl-stacks): move scheme registration from init to main

b46b091

Register the scheme in main() and exit with a printed error instead of panicking from init(), keeping all CLI failures on the same stderr+exit-code path.

chore: refresh settings catalog

6b68f1f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remove panic() from all reconciliation and CLI paths#477

fix: remove panic() from all reconciliation and CLI paths#477
flemzord wants to merge 29 commits into
mainfrom
fix/remove-panics

flemzord commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

flemzord commented Jun 9, 2026

Why

Approach by panic class

Notable details

Verification

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading